library(tidyverse)
library(rethinking)
library(dagitty)
library(knitr)
library("ggdag")
library("ggrepel")
Epidemiology,5th edition, Leon Gordis
Causal Inference: What If, by Miguel A. Hernán, James M. Robins
Causal inference in statistics: an overview, Pearl 2009
An introduction to causal inference, Pearl 2010
https://malco.io/2019/09/17/tidy-causal-dags-with-ggdag-0-2-0/
https://cran.r-project.org/web/packages/ggdag/vignettes/intro-to-ggdag.html
https://cran.r-project.org/web/packages/ggdag/vignettes/bias-structures.html
Types of Studies Experimental Clinical Trials Intervention Trials Prevention Trials Field Trials Observational Cross-sectional studies Cohort Studies (retrospective and prospective) Case-control Studies (including “nested”) Matched Case-control studies Ecological studies
| Disease + | Disease - | Total | |
|---|---|---|---|
| Test + | a = P[T(+) & D(+)] = true positive | b = P[T(+) & D(-)] = false positive | a+b = P[T(+)] = test positive |
| Test - | c = P[T(-) & D(+)] = false negative | d = P[T(-) & D(-)] = true negative | c+d = P[T(-)] = test positive |
| Total | a+c = P[D(+)] = disease positive | b+d = P[D(-)] = disease negative | 1=a+b+c+d = total |
\[P(T^+|D^+)= \frac{P(T^+ \cap D^+)}{P(D^+)}\]
\[P(T^-|D^-)= \frac{P(T^- \cap D^-)}{P(D^-)}\]
\[\frac{P(T^-|D^+)}{P(T^-|D^-)}= \frac{1-P(T^+ | D^+)}{P(T^- | D^-)}= \frac{1-sensitivity}{specificity}\]
\[\frac{P(T^+|D^+)}{P(T^+|D^-)}= \frac{P(T^+ | D^+)}{1-P(T^- | D^-)}= \frac{sensitivity}{1-specificity}\]
\[P(D^+|T^-)= \frac{P(T^+ \cap D^+)}{P(T^+)}\]
\[P(D^-|T^-)= \frac{P(T^- \cap D^-)}{P(T^+)}\]
Sensitivity and specificity Predictive value positive and predictive value negative Likelihood ratios (binary, ordinal and quantitative tests) Comparison of sensitivity and specificity of 2 tests Prevalence/apparent prevalence relationship Sensitivity, specificity and predictive values of tests in series and parallel Kappa for interobserver agreement ROC curves
FOR PROJECT:sensitivity /spp of DENV PCR test and of surveillance method. Validation of PCR test How do you define dengue exposure? What’s the PRNT cut-off?
Proportion of the population affected at time t = snapshot of disease.
Units: 0-1 or 0-100%
\[\frac{cases\:at\:time\:t\:(new + existing)}{total\:population\:at\:time\:t}\] Risk of being a case.
\[prevalence = incidence * duration\] It’s a bad measure of risk because it depends on the duration of disease. Chronic diseases will have high prevalnce, and very fatal diseases will have low prevalence, regardless of the incidence.
\[\frac{cases\:at\:time\:t\:(new + existing)}{total\:population\:at\:time\:t}\]
\[\frac{cases\:observed\:over\:period\:t\:(new + existing)}{total\:population\:at\:midpoint\:of\:period\:t}\]
Proportion of the population at risk of being affected that does become affected during a time period t. cases/population * time at risk
\[\frac{new\:cases\:observed\:over\:period\:t}{total\:population\:at\:risk\:during\:period\:t}\]
Risk of becoming a case.
\[\frac{cases}{population*time\:at\:risk}\] It measures risk because it measures events or transitions from affected to not affected state.
The population at risk is a crude measure of the population at risk at the beginning of the time period. It assumes a static population at risk.
Units: 0-1 or 0-100% per time interval.
\[\frac{new\:cases\:observed\:over\:period\:t}{total\:population\:at\:risk\:at\:start\:of\:time\:period\:t}\] Measures average risk. Is apt for short time-periods or static populations.
The population at risk is the sum of all the disease-free/ at risk time periods for each individual. It assumes the risk of each person in the population does not change over time.
Units: 0- \(\infty\) cases/population-time
\[\frac{new\:cases\:observed\:over\:period\:t}{total\:population-time}\] Measures risk by taking into account the time elapsed before disease occured for each individual, thus it also measures the speed at which disease occurs at a certain timepoint. Is apt for prolonged time-periods or dynamic populations.
To calculate population-time:
Proportion of the exposed individuals that becomes affected during a time period t.
\[\frac{cases}{exposed}\]
Relationship between incidence and prevalence. Gordis
Speed of death in time t.
Measures risk: good measure when disease is mild, bad measure when disease is very deadly and the case-fatality is high.
\[\frac{deaths\:over\:period\:t}{total\:population\:at\:risk\:during\:time\:period\:t}\]
overall deaths
\[\frac{deaths\:over\:period\:t}{total\:population\:at\:risk\:during\:time\:period\:t}\]
deaths in a specific subgroup (age, sex, diseased with a certain disease)
\[\frac{deaths\:in\:subgroup\:over\:period\:t}{population\:at\:risk\:in\:subgroup\:during\:time\:period\:t}\]
Proportion of the individuals that become affected by disease X who die during a time period t.
Measures disease severity
\[\frac{deaths}{cases}\]
Fraction of all the deaths caused by disease X
\[\frac{deaths\:from\:disease\:X}{all\:deaths}\]
overall population
adjusted rates controlling for confounding factors to remove the effect of that factor
Apply the specific subgroup rates of each population to a standard population and calculate the rate on the standard population.
direct adjusted example from Gordis
Compare populations: subgroup vs general
SMR = Strandard Mortality Ratio \[SMR= \frac{Observed}{Expected}\]
Expected: Apply the general population rates to each specific subgroup and add all the cases Observed: add all the observed cases in each specific subgroup
indirect adjusted example from Gordis
“The questions that motivate most studies in the health, social and behavioral sciences are not associational but causal in nature. For example, what is the efficacy of a given drug in a given population? What was the cause of death of a given individual, in a specific incident? These are causal questions because they require some knowledge of the data-generating process; they cannot be computed from the data alone, nor from the distributions that govern the data” (@Pearl2009).
"The aim of standard statistical analysis is to assess parameters of a distribution from samples drawn of that distribution. With the help of such parameters, associations among variables can be inferred, which permits the researcher to estimate probabilities of past and future events and update those probabilities in light of new information. These tasks are managed well by standard statistical analysis so long as experimental conditions remain the same. Causal analysis goes one step further; its aim is to infer probabilities under conditions that are changing, for example, changes induced by treatments or external interventions.
This distinction implies that causal and associational concepts do not mix; there is nothing in a distribution function to tell us how that distribution would differ if external conditions were to change—say from observational to experimental setup—because the laws of probability theory do not dictate how one property of a distribution ought to change when another property is modified. This information must be provided by causal assumptions which identify relationships that remain invariant when external conditions change.
Causal relations cannot be expressed in the language of probability and, hence, that any mathematical approach to causal analysis must acquire new notation – probability calculus is insufficient. To illustrate, the syntax of probability calculus does not permit us to express the simple fact that “symptoms do not cause diseases,” let alone draw mathematical conclusions from such facts. All we can say is that two events are dependent—meaning that if we find one, we can expect to encounter the other, but we cannot distinguish statistical dependence, quantified by the conditional probability P(disease|symptom) from causal dependence, for which we have no expression in standard probability calculus." (Pearl 2010)
Summary: The difference between association and causality is that causality is directional, which cannot be represented with standard calculus notation.
A statistical association between an exposure and an outcome can be due to either or both a:
disease does not develop without this factor
disease always develops with this factor
necessary AND sufficient: without that factor, the disease never develops, and in the presence of that factor, the disease always develops. This never occurs in nature, even pathogens require other factors.
necessary AND NOT sufficient: each factor is needed but alone is not able to cause disease. Ex: pathogen + immune susceptibility
NOT necessary BUT sufficient: each factor alone is able to cause disease, but so can other factors. Ex: leukemia can be caused by radiation or benzene exposure
NOT necessary NOR sufficient: presence of the factor by itself does not cuase disease
Example: a direct effect would arise because younger people change faster than older people and are therefore more likely to grow incompatible with a partner.
Example: age of marriage has an indirect effect by influencing the marriage rate, which then influences divorce. If people get married earlier, then the marriage rate may rise, because there are more young people. Consider for example if an evil dictator forced everyone to marry at age 65. Since a smaller fraction of the population lives to 65 than to 25, forcing delayed marriage will also reduce the marriage rate. If marriage rate itself has any direct effect on divorce, maybe by making marriage more or less normative, then some of that direct effect could be the indirect effect of age at marriage.
The organism is always found with the disease
The organism is not found with any other disease
The organism isolated from one individual with disease produces the disease in other individuals
Most important
temporal relationship: exposure to the factor occurs before disease
biologic plausibility: the association makes sense in the contex of existing knowledge.
consistency: the same result is replicated in different studies and populations
alternative explanations: confounding. exploration of the effect of other factors on the association.
Others
strength: as measured by measures of effect (risk ratio or odds ratio)
dose - response: the higher the exposure, the higher the risk of disease
specificity: hard to ascertain as most outcomes are multifactorial
cessation effect: if the exposure ceases, so does the effect
Measures of effect compare an exposed population to it’s counterfactual unexposed population, that is the exact same population at the same time point had it not been exposed. That is, the effect of \(E^+\) on the probability of being \(D^+\) in the SAME population.
Measures of association compare one exposed population to another unexposed population (a different population or the same population at a different time point) assuming that both populations are comparable. That is the effect of \(E^+\) on the probability of being \(D^+\) between \(E^-\) and \(E^+\).
causal types
Doomed = always has disease, exposed or not
Susceptible = has disease when exposed
Protected= does not have disease when exposed, but has disease when unexposed
Immune= never has disease, exposed or not
\(P(D^+|E^+) = p_1 + p_2 = doomed + susceptible\)
\(P(D^-|E^+) = p_3 + p_4 = protected + immune\)
\(P(D^+|E^-) = p_1 + p_3 = doomed + protected\)
\(P(D^-|E^-) = p_2 + p_4 = susceptible + immune\)
| Disease + | Disease - | Total | |
|---|---|---|---|
| Exposed + | a = E(+) & D(+) | b = E(+) & D(-) | a+b = E(+) |
| Exposed - | c = E(-) & D(+) | d = E(-) & D(-) | c+d = E(-) |
| Total | a+c = D(+) | b+d = D(-) | a+b+c+d = total |
measures of disease effect or association
Conditional probability refresher = \(P(A|B)=\frac{P(A \cup B)}{P(B)}\)
Measures magnitude of risk. Does not take into account the unexposed population or whether risk is associated to exposure. \[P(D^+|E^+)=\frac{a}{a+b}=\frac{new\:cases\:observed\:over\:period\:t}{total\:population\:at\:risk\:during\:period\:t}\]
Measures the strength of the association and possible causal relationship
RR = 1 \(\to\) no effect
\[\frac{P(D^+|E^+)}{P(D^+|E^-)}=\frac{\frac{a}{a+b}}{\frac{c}{c+d}}=\frac{incidence\:in\:exposed\:population}{incidence\:in\:unexposed\:population}\] It can be expressed as:
\[\frac{P(D^+|E^+)}{P(D^+|E^-)}=\frac{\frac{a}{a+b}}{\frac{c}{c+d}}=\frac{D^+\:in\:exposed\:population}{D^-\:in\:unexposed\:population}\]
\[\frac{D^+\:in\:person-time\:of\:exposed\:population}{D^+\:in\:person-time\:of\:unexposed\:population}=\frac{incidence\:in\:exposed\:population}{incidence\:in\:unexposed\:population}\]
Measures the strength of the association but cannot suggest a causal relationship
\[\frac{\frac{P(D^+|E^+)}{P(D^-|E^+)}}{\frac{P(D^+|E^-)}{P(D^-|E^-)}}=\frac{\frac{a}{b}}{\frac{c}{d}}=\frac{ad}{bc}=\frac{odds\:D^+\:in\:exposed\:population}{odds\:D^+\:in\:unexposed\:population}\] or
\[\frac{\frac{P(E^+|D^+)}{P(E^-|D^+)}}{\frac{P(E^+|D^-)}{P(E^-|D^-)}}=\frac{\frac{a}{c}}{\frac{b}{d}}=\frac{ad}{bc}=\frac{odds\:E^+\:in\:diseased\:population}{odds\:E^+\:in\:not\:diseased\:population}\]
A: cohort study, B: case-control study, Gordis Figure 11-5
matched-pairs OR:
Gordis Figure 11-9
ODDS RATIO CAN BE A GOOD ESTIMATE OF RELATIVE RISK When:
\[\frac{P(D^+|E^+)}{P(D^+|E^-)}\approx\frac{\frac{P(D^+|E^+)}{P(D^-|E^+)}}{\frac{P(D^+|E^-)}{P(D^-|E^-)}}\] \[\frac{\frac{a}{a+b}}{\frac{c}{c+d}} \approx \frac{\frac{a}{b}}{\frac{c}{d}}\]
When the disease does not occur frequently \(a+b \approx b\) and \(c+d \approx d\)
Gordis Figure 11-6, 11-7
The cases are representative, with regards to the history of exposure, of all the people with disease in the population from which the cases were drawn:
The controls are representative, with regards to the history of exposure, of all the people without disease in the population from which the cases were drawn:
The controls can be selected through different methods:
Not matched on time
case-based sampling: sampling occurs at the beginning of the study (\(t_0\))
cumulative incidence sampling: sampling occurs at the end of the study (\(t_1\))
assumption of constant incidence density rate over the period of time: \(\frac{ID_{exposed(t)}}{ID_{unexposed(t)}}=\bar{IDR}_{t_0 \to t_1}\)
assumption of a stable population with respect to exposure = time is NOT a confounder
Matched on time
assumption of constant incidence density rate over the period of time: \(\frac{ID_{exposed(t)}}{ID_{unexposed(t)}}=\bar{IDR}_{t_0 \to t_1} \to ID_{exposed(t)} = ID_{unexposed(t)} * \bar{IDR}_{t_0 \to t_1}\)
Both incidence and exposure change in function of time, therefore time is a confounder. By selecting the controls matching on time, we can interpret the odds ratio as a rate or risk measure, without making the rare disease assumption, by assuming only that the incidence is constant over time.
INTERPRETATION OF ODDS RATIO
\[\frac{\frac{P(D^+|E^+)}{P(D^-|E^+)}}{\frac{P(D^+|E^-)}{P(D^-|E^-)}}= \frac{\frac{a/a+b}{b/a+b}}{\frac{c/c+d}{d/c+d}} \ne \frac{\frac{a}{b}}{a+b}\]
Ratio of average risk \(\frac{P(D^+|E^+)}{P(D^-|E^+)}= \frac{a/a+b}{b/a+b}\) to average survival probability \(\frac{P(D^+|E^-)}{P(D^-|E^-)}= \frac{c/c+d}{d/c+d}\)
which is not the same as the average disease odds \(\frac{a/b}{a+b}\)
Incidence of a disease in the exposed population that is attributable to the exposure.
If > 1 the risk in the presence of exposure is greater that not in the presence of exposure
How much of the disease would be prevented if the exposure were eliminated?
\[{P(D^+|E^+)}-{P(D^+|E^-)}={\frac{a}{a+b}}-{\frac{c}{c+d}}={incidence\:in\:exposed\:population} -{incidence\:in\:unexposed\:population}\]
as a proportion:
\[\frac{{P(D^+|E^+)}-{P(D^+|E^-)}}{P(D^+|E^+)}=\frac{{\frac{a}{a+b}}-{\frac{c}{c+d}}}{\frac{a}{a+b}}=\frac{{incidence\:in\:exposed\:population} -{incidence\:in\:unexposed\:population}}{incidence\:in\:exposed\:population}\] \[\frac{{P(D^+|E^+)}/{P(D^+|E^-)}-{P(D^+|E^-)}/{P(D^+|E^-)}}{P(D^+|E^+){P(D^+|E^-)}}= \frac{RR - 1}{RR}= 1- 1/RR\]
ATTRIBUTABLE FRACTIONS
Etiologic fraction: proportion of cases in exposed population where exposure has a biological role in the disease
Excess fraction: proportion of cases in exposed population where exposure has a role of incrementing the disease incidence vs the unexposed population
Incidence of a disease in the total population that is attributable to the exposure.
How much of the disease in the total population would be prevented if the exposure were eliminated?
Incidence in the total population: \(P(D^+|E^+) * P(E^+) + P(D^+|E^-) * P(E^-)= \frac{a}{a+b} * \frac{a+b}{a+b+c+d} + \frac{a}{c+d} * \frac{c+d}{a+b+c+d}= incidence\:in\:exposed\:population * proportion\:exposed\:population + incidence\:in\:unexposed\:population * proportion\:unexposed\:population\)
\[\frac{[P(D^+|E^+) * P(E^+) + P(D^+|E^-) * P(E^-)]-{P(D^+|E^-)}}{P(D^+|E^+) * P(E^+) + P(D^+|E^-) * P(E^-)}=\frac{[{\frac{a}{a+b} * \frac{a+b}{a+b+c+d} + \frac{a}{c+d} * \frac{c+d}{a+b+c+d}}]-{\frac{c}{c+d}}}{\frac{a}{a+b} * \frac{a+b}{a+b+c+d} + \frac{a}{c+d} * \frac{c+d}{a+b+c+d}}=\frac{{incidence\:in\:total\:population} -{incidence\:in\:unexposed\:population}}{incidence\:in\:total\:population}\]
Gordis formula 12-4
Chance or random variation that remains unexplained.
The association lacks precision. The results are less reproducible.
The association lacks validity. The results are biased.
insert irva/Causal Inference: What If by Miguel A. Hernán, James M. Robins here
the amount of random error. High precision indicates the results are always similar in different experiments.
the amount of systematic error. High validity indicates proximity to the true value
external validity: generalizability of the results to the general population
internal validity: comparability among the groups in the study
Gordis figure 15-9
unexplained variation : non-deterministic counterfactuals
sampling error : the degree to which a sample population deviates from the total population. It’s unpredictable and due to the sampling process.
A sample is
a subset of the subjects in the population that could have been included in the study = a subset of the experiences the study subjects could have had
approaches to deal with random error
ASSUMPTIONS OF SAMPLING
Randomness assumption: the sample is a random selection of the subjects in the population that could have been included in the study
Representativeness assumption: the sample is representative of the subjects in the population that could have been included in the study
SAMPLING DISTRIBUTION
Different samples result in different measures of occurrence.
Sample size will determine:
the magnitude of the effect = the proximity of the measure of occurrence to the true value
the precision of the estimation method = statistical precision: the inverse of the variance.
interpretation of the hypothesis test:
\(H_0\): hypothesis that there is no association between 2 variables in the superpopulation that was sampled.
Reject the \(H_0\): Under the sampling distribution of the \(H_0\), the observed point estimate of the sample is inconsistent with the \(H_0\) for a given critical threshold (\(\alpha\) = 0.05).
Fail to reject the \(H_0\): Under the sampling distribution of the \(H_0\), the observed point estimate of the sample is consistent with the \(H_0\) for a given critical threshold (\(\alpha\) = 0.05). We cannot reject the \(H_0\) that the superpopulation groups are the same.
MISinterpretation of the hypothesis test:
There could be sources of uncontrolled bias, or chance alone could haveled to this point estimate.
interpretation of the significance test:
p-value: probability of observing a more extreme point estimate than that observed in the sample if the \(H_0\) were true.
Reject the \(H_0\): probability of observing a more extreme point estimate than that observed in the sample if the \(H_0\) were true is less than the probability of the critical region (\(\alpha\) = 0.05) = the smallest critical region (\(\alpha\)) that would lead us to reject the \(H_0\) if it were true.
MISinterpretation of the significance test:
p-value: probability of the observed data under the test hypothesis Incorrect because the p-value includes all other configurations of the data that result in a more extreme test statistic than that observed in the sample.
p-value: probability of the observed data would show as strong an association or stronger under the test hypothesis
Hogg,Tannis,Zimmerman
STATISTICAL ESTIMATION
Epidemiological analysis is a measurement, not a decision-making, problem. We want to estimate
the magnitude of the effect = point estimate. The proximity of the measure of occurrence to the true value depends on sample size, bias, random error, etc…
the precision of the estimation method = statistical precision: the inverse of the variance. Uncertainty of the point estimate. It depends on the random variability, the sample size, etc…
Interval estimation: provides information on:
significance testing reduce the information to a yes/no choice. It provides the degree of consistency between the data and a single hypothesis.
Non-comparability of the exposed and unexposed groups induced by a restriction in the analysis on certain level/s of a common effect of E or O or variables correlated with E or O.
This can be caused by:
unbalanced sampling fractions between the exposed and non-exposed population (see example below).
over-matching: when the cases and controls are matched on a factor that is not associated with the outcome but is associated with exposure.
The difference with confounding: confounding is due to unmeasured common causes and selection bias is due to errors in the selection of the two study groups that affects the internal validity.
non-response bias: survey non-respondents may have different characteristics than those who do respond
exclusion bias: different eligibility criteria between cases and controls or the controls are selected from a different population than the cases Foor example, if the disease is very deadly and the information about the cases comes from proxies (relatives, friends) that are from a different population than the controls.
berkson’s bias: “If the only way to cross the threshold is to score high, it is more common to score high on one item than on both. This general phenomenon is sometimes called Berkson’s paradox . But it is easier to remember if we call it the selection-distortion effect. Why do so many restaurants in good locations have bad food? The only way a restaurant with less-than-good food can survive is if it is in a nice location. Similarly, restaurants with excellent food can survive even in bad locations. Selection-distortion ruins your city.”(@mcelreath). In case-control studies that select cases from hospitals, often these cases are more likely to have concomitant diseases (hypertension, obesity) than the general population, which can lead to spurious associations.
incidence/prevalence bias: Studies that select cases from hospitals are selecting for severe disease and having survived longer. To avoid this bias it’s better to select incidence cases.
Selection bias does not depend only on exposure, it depends on the sampling fraction of all the cells, that means it can also depend on disease.
Bias caused by measurement errors. Individuals are categorized incorrectly.
For continuous variables = INFORMATION ERROR
For categorical variables = MISCLASSIFICATION
DIFFERENTIAL: the classification error depends on the actual values of other variables, that is, it varies between the study groups. The bias can be either towards or away from the \(H_0\).
recall bias: memory of events may bary between cases and controls. Example: Outcome: congenital malformation; Exposure: any; mothers of cases may have a better memory of the exposure than mothers of controls.
detection/surveillance bias: the diagnosis of the outcome may be earlier or better in monitored study groups than in the general population. Example: Outcome: emphysema; Exposure: smoking; smokers have more respoiratory issues and seek out medical care more often, therefore emphysema is detected earlier and more often in the exposed than in the unexposed.
observer bias: there is a systematic difference in how data are collected between study participants belonging to different groups. Occurs when the interviewer is not blinded and when there is interviewer drift, that is, when the data collection procedure for the same interviewer changes over time.
NON-DIFFERENTIAL: the classification error dos NOT depend on the actual values of other variables, and is due to a lack of accuracy in the data collection. This bias is usually towards the \(H_0\), diluting the magnitude of effect (RR and OR), but it can be away from the \(H_0\) if the exposure or outcome variables are non-binary, or if it’s a dependent missclasification (depends on error in the classification of other variables).
NON-DIFFERENTIAL EXPOSURE MISSCLASIFICATION
Modern Epidemiology
NON-DIFFERENTIAL DISEASE MISSCLASIFICATION
Modern Epidemiology
DEPENDENT: the classification error depends on errors made measuring or classifying other variables.
INDEPENDENT: the classification error does NOT depend on errors made measuring or classifying other variables.
The exposed and the unexposed in the study are not comparable, or exchangeable, which is the ultimate source of the bias ( the unexposed group is not the counterfactual of the exposed group)
A CONFOUNDER is a variable that is:
associated with the outcome, conditional on the exposure (i.e. in the exposed group) Example: smoking is a risk factor for cancer
associated with the exposure, conditional on the exposure (i.e. in the exposed group) Example: smoking is associated with drinking coffee
not on the causal pathway between the exposure and the outcome. Example: smoking is not a result of drinking coffee
There is confounding when the association between exposure and outcome includes a noncausal component attributable to their having an uncontrolled common cause.
There is selection bias when the association between exposure and outcome includes a noncausal component attributable to restricting the analysis to certain level(s) of a common effect of exposure and outcome or, more generally, to conditioning on a common effect of variables correlated with exposure and outcome.
The following methods are extracted from @Hernan2002
Statistical criteria alone are insufficient to characterize either confounding or selection bias. The only method to reliably identify a counfounder is to combine statistical associations from the data with background knowledge about the causal network that links exposure, outcome, and potential confounders.
automatic variable selection procedures: i.e stepwise regression. It’s based on including variables with “significant” p-values. It assumes that all important confounders will be selected.
change in effect estimate: comparison of the effect estimates between adjusted and unadjusted effect estimates. It assumes that any variable substantially associated with an estimate change is worth adjusting for.
We evaluate the crude effect measure and the effect measure stratified by the possible confounder. If the stratum-specific effect measures are similar to each other but different to the crude effect measure and the relative change greater than 10%,the variable is selected as a confounder.
Example: Is the association causal or confounded by age?
DAG = Directed Acyclic Graph
DAGs are diagrams that link variables by arrows that represent direct causal effects (protective or causative) of one variable on another.
There are only four types of variable relations that combine to form all possible paths (from @mcelreath2020):
the confounder = fork: X ← Z → Y. This is the classic confounder: some variable Z is a common cause of X and Y, generating a correlation between them. If we condition on Z, then learning X tells us nothing about Y. X and Y are independent, conditional on Z.
the pipe = intermediary: X → Z → Y. The treatment X influences Z which influences Y. If we condition on Z, we block the path from X to Y. X and Y are independent, conditional on Z.
the collider = common effect: X → Z ← Y. Conditioning on Z, the collider variable, opens the path. X and Y are dependent, conditional on Z, however neither X nor Y has any causal influence on the other.
the descendent = association?: Z \(\to\) D. Descendent is a variable influenced by another variable. Conditioning on a descendent partly conditions on its parent. Conditioning on D will also condition, to a lesser extent, on Z because D has some information about Z.
Backdoor criterion
Path: any series of variables you could walk through to get from one variable to another, ignoring the directions of the arrows.
Blocking all confounding paths between some predictor X and some outcome Y is known as shutting the backdoor, thus eliminating spurious associations that are non-causal.
Example:
There are two paths connecting E and O:
E → O
E ← C → O.
Both of these paths create a statistical association between E and O. But only the first path is causal. The second path is non-causal. If only the second path existed, and we changed E, it would not change O. Any causal influence of E on O operates only on the first path.
REMINDER:
causal effect: The total causal effect is the sum of the direct and indirect effects: Example: age of marriage influences divorce in two ways.
direct effect. A \(\to\) D Example: a direct effect would arise because younger people change faster than older people and are therefore more likely to grow incompatible with a partner.
indirect effect. A \(\to\) M \(\to\) D Example: age of marriage has an indirect effect by influencing the marriage rate, which then influences divorce. If people get married earlier, then the marriage rate may rise, because there are more young people. Consider for example if an evil dictator forced everyone to marry at age 65. Since a smaller fraction of the population lives to 65 than to 25, forcing delayed marriage will also reduce the marriage rate. If marriage rate itself has any direct effect on divorce, maybe by making marriage more or less normative, then some of that direct effect could be the indirect effect of age at marriage.
Do-operator
do(E) closes the backdoor paths into E, as in a manipulative experiment.
P(O|do(E)) defines a causal relationship because it tells us the expected result of manipulating E on O
Confounding: P(O|E) \(\ne\) P(O|do(E)). The relationship between the E and O when the backdoor paths are closed is not the same, indicating that there is confounding.
Conditional probability, non-causal: P(O|E) \(\ne\) P(O|not-E) doesn’t close the backdoor, and therefore does not give a causal relationship.
Total causal relationship: if P(O|do(E)) \(\ne\) P(O|not-E), then E is the cause of O.
Direct causal relationship: might require closing more backdoor paths.
Examples from @Hernan2002
Example using all 3 methods of identifying confounders:
Steps to identify which variables are confounders and must be controlled to obtain and unbiased causal effect estimate:
List all of the paths connecting E (the potential cause of interest) and O (the outcome).
Classify each path by whether it is open or closed. A path is open unless it contains a collider.
Classify each path by whether it is a backdoor path. A backdoor path has an arrow entering E.
If there are any open backdoor paths, decide which variable(s) to condition on to close it (if possible). To obtain a sufficient set, use the smallest set of confounders (preferable closer to the outcome)
Obtaining an unbiased estimate of the total causal effect requires measuring and adjusting for all confounders of the E \(\to\) O association
Obtaining an unbiased estimate of the direct causal effect requires measuring and adjusting for all confounders of both the
E \(\to\) O association
J \(\to\) O association
To obtain the total causal effect we don’t condition on C or J:
To obtain the total causal effect we condition on C but not J:
TESTABLE IMPLICATIONS
Testable implications can be read off the diagrams using a graphical criterion known as d- separation (Pearl, 1988). Each diagram encodes causal assumptions, each corresponding to a missing arrow or a missing double-arrow between a pair of variables.
DAGs imply that some variables are independent of others under certain conditions, therefore the testable implications of a DAG are it’s CONDITIONAL INDEPENDENCIES.
CONDITIONAL INDEPENDENCIES describe which variables should be associated with one another (or not) in the data, and which variables become disassociated when we condition on some other set of variables.
Condition independencies are pairs of variables that are not associated, once we condition on some set of other variables.
Conditioning: conditioning on a variable Z means learning its value and then asking if X adds any additional information about Y. If learning X doesn’t give you any more information about Y, then we might say that Y is independent of X conditional on Z. This conditioning statement is sometimes written as: \(Y \!\perp\!\!\!\perp X|Z\)
\(X \not\!\perp\!\!\!\perp Y\) means “not independent of”
\(X \!\perp\!\!\!\perp Y\) means “independent of”
Marginal and conditional assotiation
Modern Epidemiology
DURING STUDY DESIGN
Manipulation of the confounding factor in the study design removes the influence of C on E: when we determine E, the C variable does not influence E, thus blocking the non-causal path between E and O (E ← C → O). Once the path is blocked, there is only one way for information to go between E and O, and then measuring the association between E and O would yield a useful measure of causal influence.
Limitations: feasibility, ethics
Limitations: reduction of the number of available individuals, the factor that is used for restriction can’t be analyzed, possible residual confounding if the restriction is insufficient.
Limitations: expensive, complicated logistics, the factor that is used for matching can’t be analyzed, as the variables must be collected before participant enrollment
DURING DATA ANALYSIS
Conditioning on the confounding factor in the study analysis: adding C to the model blocks the non-causal path E ← C → O.
Why? Think of this path in isolation, as a complete model.
Once you learn C, also learning E will give you no additional information about O.
Example: Suppose for example that C is the average wealth in a region. Regions with high wealth have better schools, resulting in more education (exposure E), as well as better paying jobs, resulting in higher wages (outcome O). If you don’t know the region a person lives in, learning the person’s education E will provide information about their wages O, because E and O are correlated across regions. But after you learn which region a person lives in, assuming there is no other path between E and O, then learning E tells you nothing more about O. This is the sense in which conditioning on C blocks the path — it makes E and O independent, conditional on C.
Limitations: unless background knowledge is included in the variable selection, confounding can persist.
Limitations: can only be performed for few variables at a time
Limitations: can only be performed for few variables at a time
Modern Epidemiology
We look into causal inference using a working example from Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second edition by Richard McElreath: Correlation between marriage rate (the exposure) and divorce rate (the outcome).
# load data and copy
library(rethinking)
data(WaffleDivorce)
d <- WaffleDivorce
# standardize variables
d$D <- standardize( d$Divorce )
d$M <- standardize( d$Marriage )
d$A <- standardize( d$MedianAgeMarriage )
EXAMPLE 1
There are three observed variables in play: divorce rate (D), marriage rate (M), and the median age at marriage (A) in each State of the U.S. Both marriage rates and median age at marriage are great predictors of the divorce rate in a given State, but are these relationships causal?
The rate at which adults marry (M) is a great predictor of divorce rate. But does marriage cause divorce? In a trivial sense it obviously does: One cannot get a divorce without first getting married. But there’s no reason high marriage rate must cause more divorce. It’s easy to imagine high marriage rate indicating high cultural valuation of marriage and therefore being associated with low divorce rate.
Age at marriage (A) is also a good predictor of divorce rate— higher age at marriage predicts less divorce. But there is no reason this has to be causal, either, unless age at marriage is very late and the spouses do not live long enough to get a divorce.
Age of marriage influences divorce in two ways:
To infer the strength of these different arrows, we need more than one statistical model.
To obtain the total effect of of age at marriage on divorce rate we condition on age at marriage.
The total causal effect is the sum of the direct and indirect effects
We assume that these variables are associated. If we look in the data and find that any pair of variables are not associated, then something is wrong with the DAG (assuming the data are correct). In these data, all three pairs are in fact strongly associated. We can use cor to measure simple correlations. Correlations are sometimes terrible measures of association—many different patterns of association with different implications can produce the same correlation. But they do honest work in this case.
cor(d$D, d$M)
## [1] 0.3737314
cor(d$D, d$A)
## [1] -0.5972392
cor(d$M, d$A)
## [1] -0.721096
\(D_{i} ∼ Normal(\mu_{i}, \sigma)\)
\(\mu_{i} = \alpha + \beta_{A}A_{i}\)
m5.1 <- quap(
alist(
D ~ dnorm( mu , sigma ) ,
mu <- a + bA * A ,
a ~ dnorm( 0 , 0.2 ) ,
#when βA = 1, a change of 1.2 years in median age at marriage is associated with a full standard deviation change in the outcome variable (divorce)
#only 5% of plausible slopes more extreme than 1.
bA ~ dnorm( 0 , 0.5 ) ,
sigma ~ dexp( 1 )
) , data = d )
precis(m5.1)
## mean sd 5.5% 94.5%
## a -2.140546e-05 0.09738192 -0.1556565 0.1556137
## bA -5.684290e-01 0.11000402 -0.7442366 -0.3926213
## sigma 7.883592e-01 0.07801955 0.6636689 0.9130495
The outcome and the predictor are both standardized, the intercept α should end up very close to zero.
What does the prior slope \(\beta_{A}\) imply? If \(\beta_{A}\) = 1, that would imply that a change of one standard deviation in age at marriage is associated likewise with a change of one standard deviation in divorce. To know whether or not that is a strong relationship, you need to know how big a standard deviation of age at marriage is:
sd( d$MedianAgeMarriage )
## [1] 1.24363
So when \(\beta_{A}\) = 1, a change of 1.2 years in median age at marriage is associated with a full standard deviation change in the outcome variable. That seems like an insanely strong relationship.
posterior for \(\beta_{A}\) is reliably negative, as seen:
# compute percentile interval of mean
A_seq <- seq( from=-3 , to=3.2 , length.out=30 )
mu <- link( m5.1 , data=list(A=A_seq) )
mu.mean <- apply( mu , 2, mean )
mu.PI <- apply( mu , 2 , PI )
# plot it all
plot( D ~ A , data=d , col=rangi2 )
lines( A_seq , mu.mean , lwd=2 )
shade( mu.PI , A_seq )
Model m5.1, the regression of D on A, tells us only that the total influence of age at marriage is strongly negative with divorce rate. The “total” here means we have to account for every path from A to D. There are two such paths in this graph: A → D, a direct path,and A → M → D, an indirect path.
In general, it is possible that a variable like A has no direct effect at all on an outcome like D. It could still be associated with D entirely through the indirect path. That type of relationship is known as mediation.
As you’ll see however, the indirect path does almost no work in this case. How can we show that?
\(D_{i} ∼ Normal(\mu_{i}, \sigma)\)
\(\mu_{i} = \alpha + \beta_{M}M_{i}\)
m5.2 <- quap(
alist(
D ~ dnorm( mu , sigma ) ,
mu <- a + bM * M ,
a ~ dnorm( 0 , 0.2 ) ,
bM ~ dnorm( 0 , 0.5 ) ,
sigma ~ dexp( 1 )
) , data = d )
precis(m5.2)
## mean sd 5.5% 94.5%
## a 3.731793e-05 0.10824625 -0.1729611 0.1730357
## bM 3.501208e-01 0.12592675 0.1488656 0.5513761
## sigma 9.102633e-01 0.08986183 0.7666467 1.0538799
# compute percentile interval of mean
M_seq <- seq( from=-3 , to=3.2 , length.out=30 )
mu <- link( m5.2 , data=list(M=M_seq) )
mu.mean <- apply( mu , 2, mean )
mu.PI <- apply( mu , 2 , PI )
# plot it all
plot( D ~ M , data=d , col=rangi2 )
lines( M_seq , mu.mean , lwd=2 )
shade( mu.PI , M_seq )
This relationship isn’t as strong as the one between A and D.
The pattern we see in the previous two models is symptomatic of a situation in which only one of the predictor variables, A in this case, has a causal impact on the outcome, D, even though both predictor variables are strongly associated with the outcome.
We know from the model conditioning on M (m5.2) that marriage rate is positively associated with divorce rate. But that isn’t enough to tell us that the path M → D is positive. It could be that the association between M and D arises entirely from A’s influence on both M and D. Like this:
This DAG is also consistent with the posterior distributions of models m5.1 and m5.2. Why? Because both M and D “listen” to A. They have information from A. So when you inspect the association between D and M, you pick up that common information that they both got from listening to A.
So which is it? Is there a direct effect of marriage rate, or rather is age at marriage just driving both, creating a spurious correlation between marriage rate and divorce rate? To find out, we need to consider carefully what each DAG implies.
This DAG says:
There are 3 causal assumptions that can be tested (one for every arrow).
Before we condition on anything, we assume everything is associated with everything else.
The testable implications are:
implied conditional independencies = none
DMA_dag1 <- dagitty('dag{ D <- A -> M -> D }')
impliedConditionalIndependencies( DMA_dag1 )
In this DAG, it is still true that all three variable are associated with one another. A is associated with D and M because it influences them both. And D and M are associated with one another, because A influences them both. They share a cause, and this leads them to be correlated with one another through that cause.
There are 2 causal assumptions that can be tested (one for every arrow).
(2) M causes D
But suppose we condition on A. All of the information in M that is relevant to predicting D is in A. So once we’ve conditioned on A, M tells us nothing more about D. So in the second DAG, a testable implication is that D is independent of M, conditional on A. In other words, \(D \!\perp\!\!\!\perp M|A\)
The testable implications are:
All 3 variables should be associated, before conditioning on anything:
\(D \not\!\perp\!\!\!\perp A\) A not independent of D
\(D \not\!\perp\!\!\!\perp M\) M not independent of D
\(A \not\!\perp\!\!\!\perp M\) D not independent of A
\(D \!\perp\!\!\!\perp M|A\) D and M should be independent after conditioning on A.
implied conditional independencies = D || M | A
DMA_dag2 <- dagitty('dag{ D <- A -> M }')
impliedConditionalIndependencies( DMA_dag2 )
## D _||_ M | A
Test the difference between the two DAGs
The only implication that differs between these DAGs is the last one:\(D \!\perp\!\!\!\perp M|A\) D and M should be independent after conditioning on A.
To test this implication, we need a statistical model that conditions on A, so we can see whether that renders D independent of M. And that is what multiple regression helps with. It can address a useful descriptive question: Is there any additional value in knowing a variable, once I already know all of the other predictor variables?
So for example once you fit a multiple regression to predict divorce using both marriage rate and age at marriage, the model addresses the questions: (1) After I already know marriage rate, what additional value is there in also knowing age at marriage? (2) After I already know age at marriage, what additional value is there in also knowing marriage rate?
The parameter estimates corresponding to each predictor are the (often opaque) answers to these questions. The questions above are descriptive, and the answers are also descriptive. It is only the derivation of the testable implications above that give these descriptive results a causal meaning. But that meaning is still dependent upon believing the DAG.
For each predictor, the parameter measures its conditional association with the outcome.
\(D_{i} ∼ Normal(\mu_{i}, \sigma)\)
\(\mu_{i} = \alpha + \beta_{M}M_{i} + \beta_{A}A_{i}\)
m5.3 <- quap(
alist(
D ~ dnorm( mu , sigma ) ,
mu <- a + bM*M + bA*A ,
a ~ dnorm( 0 , 0.2 ) ,
bM ~ dnorm( 0 , 0.5 ) ,
bA ~ dnorm( 0 , 0.5 ) ,
sigma ~ dexp( 1 )
) , data = d )
precis( m5.3 )
## mean sd 5.5% 94.5%
## a 9.669939e-05 0.09708007 -0.1550560 0.1552494
## bM -6.552244e-02 0.15077879 -0.3064961 0.1754512
## bA -6.136422e-01 0.15098906 -0.8549519 -0.3723325
## sigma 7.851608e-01 0.07785385 0.6607353 0.9095863
The posterior mean for marriage rate, bM, is now close to zero, with plenty of probability of both sides of zero. The posterior mean for age at marriage, bA, is essentially unchanged. It will help to visualize the posterior distributions for all three models, focusing just on the slope parameters βA and βM:
plot(coeftab(m5.1,m5.2,m5.3), par=c("bA","bM"))
bA doesn’t move, only grows a bit more uncertain, while bM is only associated with divorce when age at marriage is missing from the model. You can interpret these distributions as saying: Once we know median age at marriage for a State, there is little or no additional predictive power in also knowing the rate of marriage in that State, which means \(D \!\perp\!\!\!\perp M|A\). D and M are independent after conditioning on A, which corresponds to the second DAG.
Note that this does not mean that there is no value in knowing marriage rate. Consistent with the earlier DAG, if you didn’t have access to age-at-marriage data, then you’d definitely find value in knowing the marriage rate. M is predictive but not causal. Assuming there are no other causal variables missing from the model, this implies there is no important direct causal path from marriage rate to divorce rate. The association between marriage rate and divorce rate is spurious, caused by the influence of age of marriage on both marriage rate and divorce rate.
EXAMPLE 2
We’re interested in the total causal effect of the number of Waffle Houses on divorce rate in each State. Presumably, the naive correlation between these two variables is spurious. What is the minimal adjustment set that will block backdoor paths from Waffle House to divorce?
Let’s make a graph:
## { A, M }
## { S }
We could control for either A and M or for S alone. This DAG is obviously not satisfactory—it assumes there are no unobserved confounds, which is very unlikely for this sort of data. But we can still learn something by analyzing it. While the data cannot tell us whether a graph is correct, it can sometimes suggest how a graph is wrong.
Inspecting implied conditional independencies, we can at least test some of the features of a graph.
impliedConditionalIndependencies( dag6 )
## A _||_ W | S
## D _||_ S | A, M, W
## M _||_ W | S
The median age of marriage should be independent of (||) Waffle Houses, conditioning on (|) a State being in the south.
Divorce and being in the south should be independent when we simultaneously condition on all of median age of marriage, marriage rate, and Waffle Houses.
Marriage rate and Waffle Houses should be independent, conditioning on being in the south.
An interaction occurs when the measure of association between a risk factor and outcome depends on the level of another factor.
| confounding | interaction | |
|---|---|---|
| origin | property of distribution in source popuation | biologic, sociologic… |
| distribution of the 3d variable | differs by exposure status | can be statistically independent of exposure |
| does the 3d variable predic the outcome? | yes | not necessarily |
| stratum-specific measures | similar across strata | differ across strata |
| crude overall measure | differ from stratum-specific measures | falls between stratum-specific measures |
| summary adjusted measure | makes sense | not meaningful |
| testing | cannot test | can test |
| relationship to the other | if a strong interaction exists, it makes no sense to discuss confounding | both may occur |
Interaction, effect modification and effect heterogeneity sometimes are used to mean the same thing.
Interaction Tables
| A no | A yes | |
|---|---|---|
| B no | Rab | RAb |
| B yes | RaB | RAB |
Relative Risk in Interaction
| A no | A yes | |
|---|---|---|
| B no | 1 | RAb/Rab |
| B yes | RaB/Rab | RAB/Rab |
To obtain the relative risk measure, we divide the risk measures in the presence of A by the referent within each level of B, thus we have 2 groups, one for each strata.
Relative Risk in effect modification
| A no | A yes | |
|---|---|---|
| B no | 1 | RaB/Rab |
| B yes | 1 | RAB/RAb |
synergism: the additive measure of joint exposure is MORE THAN the sum of the measures of association between exposure and outcome for each exposure alone.
antagonism: the additive measure of joint exposure is LESS THAN the sum of the measures of association between the exposure and outcome for each exposure alone.
additive: Excess risk and linear regression are on the additive scale.
multiplicative: Relative risk, logistic, Poisson and porportional hazards models all assume factors act multiplicatively unless an interaction effect is included.
Gordis
Example:
A logistic regression is on the additive scale when it’s transformed:
\(logit(\hat{p_i}) = log(p_i/(1-p_i)) = log(ODDS) =\beta_0 +\beta_A x_A + \beta_B x_B + +\beta_{AB} x_A x_B\)
But a logistic regression is on the multiplicative scale when it’s on the original probability scale:
\(p_i/(1-p_i) = ODDS =e^{\beta_0 +\beta_A x_A + \beta_B x_B + +\beta_{AB} x_A x_B}\)
Is A and effect modifier of B? What is the effect of B versus no B?
In the presence of A: \(x_A = 1\)
\(log(ODDS) =\beta_0 +\beta_A x_A + \beta_B x_B + \beta_{AB} x_A x_B\)
\(log(ODDS) =\beta_0 +\beta_A x_A\)
The difference in effect between B yes and B no is \(\beta_B + \beta_{AB}\)
\((\beta_0 +\beta_A x_A + \beta_B x_B + \beta_{AB} x_A x_B) - (\beta_0 +\beta_A x_A ) = \beta_B + \beta_{AB}\)
Not in the presence of A: \(x_A = 0\)
\(log(ODDS) =\beta_0 + \beta_B x_B\)
\(log(ODDS) =\beta_0\)
The difference in effect between B yes and B no is \(\beta_B\)
\((\beta_0 + \beta_B x_B) - (\beta_0) = \beta_B\)
There is effect modification if \(\beta_B \ne \beta_B+ \beta_{AB}\), that is, if the presence of B has a different effect on the different levels of A.
automatic variable selection procedures: i.e stepwise regression. It’s based on including variables with “significant” p-values. An interaction exists if an interaction coefficient has a “significant p-value”.
change in effect estimate: comparison of the effect estimates between adjusted and unadjusted effect estimates.
We evaluate the crude effect measure and the effect measure stratified by the possible confounder or effect modifier. If the stratum-specific effect measures are different from each other and the relative change greater than 10%, the variable is selected as a interaction.
If \(\beta_B\) in the presence of B but not A \(\ne\) \(\beta_B+ \beta_{AB}\) in the presence of B and A, that is, if the presence of B has a different effect on the different levels of A.
Gordis
Age adjusted rate ratios for lung cancer due to arsenic exposure.
| Unexposed area | Residentially exposed | Smelter worker | |
|---|---|---|---|
| Non smokers | 1 | 2.3 | 8.4 |
| Smokers- | 8.3 | 17.5 | 26.2 |
Does smoking interact multiplicatively with arsenic?
| Unexposed area | Residentially exposed | Smelter worker | |
|---|---|---|---|
| Non smokers | 1 | 2.3 * 8.3 = 19 | 8.4 * 8.3 = 69 |
| Smokers- | 8.3 | 17.5 | 26.2 |
There is no multiplicative interaction between smoking and residential arsenic exposure on lung cancer risk: \(2.3 * 8.3 = 19 \approx 17.5\)
There is an antagonistic multiplicative interaction between smoking and smelter worker arsenic exposure on lung cancer risk: \(8.4 * 8.3 = 69 \ne 26.2\)
Is smoking an effect modifier of the arsenic relative risk?
| Unexposed area | Residentially exposed | Smelter worker | |
|---|---|---|---|
| Non smokers | 1 | 2.3/1 | 8.4/1 |
| Smokers- | 8.3/8.3 | 17.5/8.3 | 26.2/8.3 |
| Unexposed area | Residentially exposed | Smelter worker | |
|---|---|---|---|
| Non smokers | 1 | 2.3 | 8.4 |
| Smokers- | 1 | 2.1 | 3.2 |
Smoking does not modify the effect of residential arsenic exposure on lung cancer risk: \(2.3 \approx 2.1\)
Smoking does modify the effect of smelter workers’ arsenic exposure on lung cancer risk: this exposure has a much larger effect on non-smokers that smokers: \(8.4 > 3.2\)
If we were looking at excess risk instead of relative risk, we would look at the additive effect.
Is smoking an effect modifier of the arsenic excess risk?
| Unexposed area | Residentially exposed | Smelter worker | |
|---|---|---|---|
| Non smokers | 1 | 2.3-1 | 8.4-1 |
| Smokers- | 8.3 | 17.5-8.3 | 26.2-8.3 |
| Unexposed area | Residentially exposed | Smelter worker | |
|---|---|---|---|
| Non smokers | 1 | 1.3 | 7.4 |
| Smokers- | 8.3 | 9.2 | 17.9 |
Smoking does modify the effect of residential arsenic exposure on lung cancer excess risk synergistically: \(1.3 < 9.2\)
Smoking does modify the effect of smelter workers’ arsenic exposure on lung cancer excess risk synergistically: \(7.4 < 17.9\)
Does smoking interact additively with arsenic?
| Unexposed area | Residentially exposed | Smelter worker | |
|---|---|---|---|
| Non smokers | 1 | 2.3-1 | 8.4-1 |
| Smokers- | 8.3-1 | 17.5-1 | 26.2-1 |
| Unexposed area | Residentially exposed | Smelter worker | |
|---|---|---|---|
| Non smokers | 1 | 1.3 | 7.4 |
| Smokers- | 7.3 | 16.5 | 25.2 |
There is additive interaction between smoking and residential arsenic exposure on lung cancer risk: \(1.3 + 7.3 = 8.6 \ne 16.5\)
There is no additive interaction between smoking and smelter worker arsenic exposure on lung cancer risk: \(7.4 + 16.5 = 23.9 \approx 25.2\)
Africa is geographically special, in a puzzling way: Bad geography tends to be related to bad economies outside of Africa, but African economies may actually benefit from bad geography. To appreciate the puzzle, look at regressions of terrain ruggedness—a particular kind of bad geography—against economic performance (log GDP136 per capita in the year 2000), both inside and outside of Africa. The variable rugged is a Terrain Ruggedness Index137 that quantifies the topographic heterogeneity of a landscape. The outcome variable here is the logarithm of real gross domestic product per capita, from the year 2000, rgdppc_2000. We use the logarithm of it, because the logarithm of GDP is the magnitude of GDP.
library(rethinking)
data(rugged)
d <- rugged
# make the log version of criterion
d <-
d %>%
mutate(log_gdp = log(rgdppc_2000))
# extract countries with GDP data
dd <-
d %>%
filter(complete.cases(rgdppc_2000)) %>%
# re-scale variables
mutate(log_gdp_std = log_gdp / mean(log_gdp),
rugged_std = rugged / max(rugged))
If this relationship is at all causal, it may be because rugged regions of Africa were pro- tected against the Atlantic and Indian Ocean slave trades. Slavers preferred to raid easily accessed settlements, with easy routes to the sea. Those regions that suffered under the slave trade understandably continue to suffer economically, long after the decline of slave-trading markets. However, an outcome like GDP has many influences, and is furthermore a strange measure of economic activity. And ruggedness is correlated with other geographic features, like coastlines, that also influence the economy. So it is hard to be sure what’s going on here.
The causal hypothesis, in DAG form, might be (where R is terrain ruggedness, G is GDP, C is continent): R and C both influence G. This could mean that they are independent influences or rather that they interact (one moderates the influence of the other). The DAG does not display an interaction. That’s because DAGs do not specify how variables combine to influence other variables. The DAG above implies only that there is some function that uses R and C to generate G. In typical notation, G = f(R, C).
So we need a statistical approach to judge different propositions for f(R, C).
So we need a statistical approach to judge different propositions for f(R, C). How do we make a model that produces the conditionality in Figure 8.2? We could cheat by splitting the data into two data frames, one for Africa and one for all the other continents. But it’s not a good idea to split the data in this way. Here are four reasons:
First, there are usually some parameters, such as σ, that the model says do not depend in any way upon continent. By splitting the data table, you are hurting the accuracy of the es- timates for these parameters, because you are essentially making two less-accurate estimates instead of pooling all of the evidence into one estimate. In effect, you have accidentally as- sumed that variance differs between African and non-African nations. Now, there’s nothing wrong with that sort of assumption. But you want to avoid accidental assumptions.
Second, in order to acquire probability statements about the variable you used to split the data, cont_africa in this case, you need to include it in the model. Otherwise, you have a weak statistical argument. Isn’t there uncertainty about the predictive value of distinguishing between African and non-African nations? Of course there is. Unless you analyze all of the data in a single model, you can’t easily quantify that uncertainty. If you just let the posterior distribution do the work for you, you’ll have a useful measure of that uncertainty.
Third, we may want to use information criteria or another method to compare models. In order to compare a model that treats all continents the same way to a model that allows different slopes in different continents, we need models that use all of the same data (as explained in Chapter 7). This means we can’t split the data for two separate models. We have to let a single model internally split the data.
Fourth, once you begin using multilevel models (Chapter 13), you’ll see that there are advantages to borrowing information across categories like “Africa” and “not Africa.” This is especially true when sample sizes vary across categories, such that overfitting risk is higher within some categories. In other words, what we learn about ruggedness outside of Africa should have some effect on our estimate within Africa, and visa versa. Multilevel models (Chapter 13) borrow information in this way, in order to improve estimates in all categories. When we split the data, this borrowing is impossible.
We look at the different possibilities to model the association between log(GDP) and ruggedness
Fit a single model to all the data, ignoring continent.
\(log(y_i) \sim N(\mu_i, \sigma)\)
\(\mu_i = \alpha_0 + \beta (r_i - \bar{r})\)
m8.1 <- quap(
alist(
log_gdp_std ~ dnorm( mu , sigma ) ,
mu <- a + b*( rugged_std - 0.215 ) ,
a ~ dnorm( 1 , 0.1 ) ,
b ~ dnorm( 0 , 0.3 ) ,
sigma ~ dexp(1)
) , data=dd )
precis( m8.1 )
## mean sd 5.5% 94.5%
## a 1.000000079 0.010412512 0.98335887 1.01664128
## b 0.002001904 0.054796238 -0.08557307 0.08957688
## sigma 0.136504552 0.007397121 0.12468252 0.14832658
Really no overall association between terrain ruggedness and log GDP.
Next we’ll see how to split apart the continents.
The first thing to realize is that just in- cluding an indicator variable for African nations, cont_africa here, won’t reveal the re- versed slope. It’s worth fitting this model to prove it to yourself, though.
\(log(y_i) \sim N(\mu_i, \sigma)\)
\(\mu_i = \alpha_0 + \beta (r_i - \bar{r}) + \gamma A_i\)
where \(A_i\) is cont_africa, a 0/1 indicator variable.
The problem here, and in general, is that we need a prior for \(\gamma\). Okay, we can do priors. But what that prior will necessarily do is tell the model that \(\mu_i\) for a nation in Africa is more uncertain, before seeing the data, than \(\mu_i\) outside Africa. And that makes no sense.
There is a simple solution: Nations in Africa will get one intercept and those outside Africa another.
\(log(y_i) \sim N(\mu_i, \sigma)\)
\(\mu_i = \alpha_{CID[i]} + \beta (r_i - \bar{r})\)
where cid is an index variable, continent ID. It takes the value 1 for African nations and 2 for all other nations. This means there are two parameters, \(\alpha_{1}\) and \(\alpha_{2}\) one for each unique index value. The notation cid[i] just means the value of cid on row i. Using this approach, instead of the conventional approach of adding another term with the 0/1 indicator variable, doesn’t force us to say that the mean for Africa is inherently less certain than the mean for all other continents.
#make variable to index Africa (1) or not (2)
dd$cid <- ifelse( dd$cont_africa==1 , 1 , 2 )
m8.2 <- quap( alist(
log_gdp_std ~ dnorm( mu , sigma ) ,
mu <- a[cid] + b*( rugged_std - 0.215 ) , a[cid] ~ dnorm( 1 , 0.1 ) ,
b ~ dnorm( 0 , 0.3 ) ,
sigma ~ dexp( 1 )
) , data=dd )
precis( m8.2 , depth=2 )
## mean sd 5.5% 94.5%
## a[1] 0.88041284 0.015937003 0.8549424 0.90588325
## a[2] 1.04916425 0.010185554 1.0328858 1.06544274
## b -0.04651347 0.045686725 -0.1195297 0.02650274
## sigma 0.11238738 0.006091077 0.1026527 0.12212209
The parameter a[1] is the intercept for African nations. It seems reliably lower than a[2].
Now to compare these models, using WAIC:
## WAIC SE dWAIC dSE pWAIC weight
## m8.2 -252.2687 15.30518 0.00000 NA 4.258517 1.000000e+00
## m8.1 -188.7544 13.29153 63.51424 15.14616 2.690177 1.614571e-14
m8.2 gets all the model weight. And while the standard error of the difference in WAIC is 15, the difference itself is 64. So the continent variable seems to be picking up some important association in the sample.
Let’s plot the posterior predictions for m8.2, so you can see how, despite it’s predictive superiority to m8.1, it still doesn’t manage different slopes inside and outside of Africa. African nations are shown in blue, while nations outside Africa are shown in gray. What you’ve ended up with here is a rather weak negative relationship between economic development and ruggedness. The African nations do have lower overall economic development, and so the blue regression line is below, but parallel to, the black line. All including a dummy variable for African nations has done is allow the model to predict a lower mean for African nations. It can’t do anything to the slope of the line. The fact that WAIC tells you that the model with the dummy variable is hugely better only indicates that African nations on average do have lower GDP.
How can you recover the change in slope you saw at the start of this section? You need a proper interaction effect. This just means we also make the slope conditional on continent.
And again, there is a conventional approach to specifying an interaction that uses an indica- tor variable and a new interaction parameter. It would look like this:
\(log(y_i) \sim N(\mu_i, \sigma)\)
\(\mu_i = \alpha_{CID[i]} + (\beta + \gamma A_i) (r_i - \bar{r})= \alpha_{CID[i]} +\beta(r_i - \bar{r})+ \gamma A_i(r_i - \bar{r})\)
where Ai is a 0/1 indicator for African nations. This is equivalent to our index approach, but it is much harder to state sensible priors. Any prior we put on γ makes the slope inside Africa more uncertain than the slope outside Africa. And again that makes no sense. But in the indexing approach, we can easily assign the same prior to the slope, no matter which continent.
\(log(y_i) \sim N(\mu_i, \sigma)\)
\(\mu_i = \alpha_{CID[i]} + \beta_{CID[i]}(r_i - \bar{r})\)
m8.3 <- quap(
alist(
log_gdp_std ~ dnorm( mu , sigma ) ,
mu <- a[cid] + b[cid]*( rugged_std - 0.215 ) ,
a[cid] ~ dnorm( 1 , 0.1 ) ,
b[cid] ~ dnorm( 0 , 0.3 ) ,
sigma ~ dexp( 1 )
) , data=dd )
precis( m8.3 , depth=2 )
## mean sd 5.5% 94.5%
## a[1] 0.8865639 0.015675695 0.8615111 0.91161667
## a[2] 1.0505698 0.009936606 1.0346892 1.06645039
## b[1] 0.1325054 0.074204443 0.0139124 0.25109846
## b[2] -0.1425767 0.054749403 -0.2300768 -0.05507653
## sigma 0.1094941 0.005935299 0.1000084 0.11897987
The slope is essentially reversed inside Africa, 0.13 instead of −0.14.
How much does allowing the slope to vary improve expected prediction? Let’s use PSIS to compare this new model to the previous two. You could use WAIC here as well. It’ll give almost identical results.
## PSIS SE dPSIS dSE pPSIS weight
## m8.3 -258.9122 15.26994 0.00000 NA 5.248904 9.731544e-01
## m8.2 -251.7313 15.27774 7.18088 6.563466 4.462810 2.684562e-02
## m8.1 -188.7363 13.28687 70.17592 15.393216 2.681052 5.619209e-16
Model family m8.3 has more than 95% of the weight. That’s very strong support for including the interaction effect, if prediction is our goal. But the modicum of weight given to m8.2 suggests that the posterior means for the slopes in m8.3 are a little overfit. And the standard error of the difference in PSIS between the top two models is almost the same as the difference itself.
# plot Africa - cid=1
d.A1 <- dd[ dd$cid==1 , ]
plot( d.A1$rugged_std , d.A1$log_gdp_std , pch=16 , col=rangi2 ,
xlab="ruggedness (standardized)" , ylab="log GDP (as proportion of mean)" ,
xlim=c(0,1) )
mu <- link( m8.3 , data=data.frame( cid=1 , rugged_std=rugged_seq ) )
mu_mean <- apply( mu , 2 , mean )
mu_ci <- apply( mu , 2 , PI , prob=0.97 )
lines( rugged_seq , mu_mean , lwd=2 )
shade( mu_ci , rugged_seq , col=col.alpha(rangi2,0.3) )
mtext("African nations")
# plot non-Africa - cid=2
d.A0 <- dd[ dd$cid==2 , ]
plot( d.A0$rugged_std , d.A0$log_gdp_std , pch=1 , col="black" ,
xlab="ruggedness (standardized)" , ylab="log GDP (as proportion of mean)" ,
xlim=c(0,1) )
mu <- link( m8.3 , data=data.frame( cid=2 , rugged_std=rugged_seq ) )
mu_mean <- apply( mu , 2 , mean )
mu_ci <- apply( mu , 2 , PI , prob=0.97 )
lines( rugged_seq , mu_mean , lwd=2 )
shade( mu_ci , rugged_seq )
mtext("Non-African nations")
The core concept is easy to understand: When you condition on a collider, it creates statistical—but not necessarily causal— associations among its causes.
Collider bias arises from conditioning on a common consequence. If we can just get our graph sorted, we can avoid it. But it isn’t always so easy to see a potential collider, because there may be unmeasured causes. Unmeasured causes can still induce collider bias. So I’m sorry to say that we also have to consider the possibility that our DAG may be haunted.
We want to infer the direct influence of both parents (P) and grandparents (G) on the educational achievement of children (C). Since grandparents also presumably influence their own children’s education, there is an arrow G → P. This sounds pretty easy, so far. It’s similar in structure to our divorce rate example from last chapter:
But suppose there are unmeasured, common influences on parents and their children, such as neighborhoods, that are not shared by grandparents (who live on the south coast of Spain now). Then our DAG becomes haunted by the unobserved U:
Now P is a common consequence of G and U, so if we condition on P, it will bias inference about G → C, even if we never get to measure U. I don’t expect that fact to be immediately obvious. So let’s crawl through a quantitative example.
First, let’s simulate 200 triads of grandparents, parents, and children. This simulation will be simple. We’ll just project our DAG as a series of implied functional relationships. The DAG above implies that:
P is some function of G and U
C is some function of G, P, and U
G and U are not functions of any other known variables
We can make these implications into a simple simulation, using rnorm to generate simulated observations. But to do this, we need to be a bit more precise than “some function of.” So I’ll invent some strength of association:
N <- 200 # number of grandparent-parent-child triads
b_GP <- 1 # direct effect of G on P
b_GC <- 0 # direct effect of G on C
b_PC <- 1 # direct effect of P on C
b_U<-2 #direct effect of U on P and C
These parameters are like slopes in a regression model. Notice that I’ve assumed that grand- parents G have zero effect on their grandkids C. The example doesn’t depend upon that effect being exactly zero, but it will make the lesson clearer. Now we use these slopes to draw random observations:
set.seed(1)
U <- 2*rbern( N , 0.5 ) - 1
G <- rnorm( N )
P <- rnorm( N , b_GP*G + b_U*U )
C <- rnorm( N , b_PC*P + b_GC*G + b_U*U )
d <- data.frame( C=C , P=P , G=G , U=U )
I’ve made the neighborhood effect, U, binary. This will make the example easier to under- stand. But the example doesn’t depend upon that assumption. The other lines are just linear models embedded in rnorm.
Now what happens when we try to infer the influence of grandparents? Since some of the total effect of grandparents passes through parents, we realize we need to control for parents.
Here is a simple regression of C on P and G. Normally I would advise standardizing the variables, because it makes establishing sensible priors a lot easier. But I’m going to keep the simulated data on its original scale, so you can see what happens to inference about the slopes above. If we changed the scale, we shouldn’t expect to get those values back. But if we leave the scale alone, we should be able to recover something close to those values. So I apologize for using vague priors here, just to push forward in the example.
m6.11 <- quap( alist(
C ~ dnorm( mu , sigma ),
mu <- a + b_PC*P + b_GC*G,
a ~ dnorm( 0 , 1 ),
c(b_PC,b_GC) ~ dnorm( 0 , 1 ),
sigma ~ dexp( 1 )
), data=d )
precis(m6.11)
## mean sd 5.5% 94.5%
## a -0.1174752 0.09919574 -0.2760091 0.04105877
## b_PC 1.7868915 0.04455355 1.7156863 1.85809664
## b_GC -0.8389537 0.10614045 -1.0085867 -0.66932077
## sigma 1.4094891 0.07011139 1.2974375 1.52154063
The inferred effect of parents looks too big, almost twice as large as it should be. That isn’t surprising. Some of the correlation between P and C is due to U, and the model doesn’t know about U. That’s a simple confound. More surprising is that the model is confident that the direct effect of grandparents is to hurt their grandkids. The regression is not wrong. But a causal interpretation of that association would be.
Note that I did standardize the variables to make this plot. So the units on the axes are standard deviations. The horizontal axis is grandparent education. The vertical is grandchild education. There are two clouds of points. The blue cloud comprises children who live in good neighborhoods (U = 1). The black cloud comprises children who live in bad neighborhoods (U = −1). No- tice that both clouds of points show positive associations between G and C. More educated grandparents have more educated grandkids, but this effect arises entirely through parents. Why? Because we assumed it is so. The direct effect of G in the simulation is zero.
So how does the negative association arise, when we condition on parents? Conditioning on parents is like looking within sub-populations of parents with similar education. So let’s try that. In the Figure, I’ve highlighted in filled points those parents between the 45th and 60th centiles of education. There is nothing special of this range. It just makes the phenomenon easier to see. Now if we draw a regression line through only these points, regressing C on G, the slope is negative. There is the negative association that our multiple regression finds. But why does it exist?
It exists because, once we know P, learning G invisibly tells us about the neighborhood U, and U is associated with the outcome C. I know this is confusing. As I keep saying, if you are confused, it is only because you are paying attention.
So consider two different parents with the same education level, say for example at the median 50th centile. One of these parents has a highly educated grandparent. The other has a poorly educated grandparent. The only probable way, in this example, for these parents to have the same education is if they live in different types of neighborhoods. We can’t see these neighborhood effects—we haven’t measured them, recall—but the influence of neighborhood is still transmitted to the children C. So for our mythical two parents with the same education, the one with the highly educated grandparent ends up with a less well educated child. The one with the less educated grandparent ends up with the better educated child. G predicts lower C.
The unmeasured U makes P a collider, and conditioning on P produces collider bias. So what can we do about this? You have to measure U. Here’s the regression that conditions also on U:
m6.12 <- quap( alist(
C ~ dnorm( mu , sigma ),
mu <- a + b_PC*P + b_GC*G + b_U*U,
a ~ dnorm( 0 , 1 ),
c(b_PC,b_GC,b_U) ~ dnorm( 0 , 1 ),
sigma ~ dexp( 1 ) ), data=d )
precis(m6.12)
## mean sd 5.5% 94.5%
## a -0.12197510 0.07192588 -0.2369265 -0.007023655
## b_PC 1.01161103 0.06597258 0.9061741 1.117047948
## b_GC -0.04081373 0.09728716 -0.1962974 0.114669941
## b_U 1.99648992 0.14770462 1.7604294 2.232550439
## sigma 1.01959911 0.05080176 0.9384081 1.100790130
And those are the slopes we simulated with.
Rethinking: Statistical paradoxes and causal explanations. The grandparents example serves as an example of Simpson’s paradox: Including another predictor (P in this case) can reverse the direction of association between some other predictor (G) and the outcome (C). Usually, Simpson’s paradox is presented in cases where adding the new predictor helps us. But in this case, it misleads us. Simpson’s paradox is a statistical phenomenon. To know whether the reversal of the association correctly reflects causation, we need something more than just a statistical model.
From @mcelreath2020:
Multicollinearity means a very strong association between two or more predictor variables. The raw correlation isn’t what matters. Rather what matters is the association, conditional on the other variables in the model. The consequence of multicollinearity is that the posterior distribution will seem to suggest that none of the variables is reliably associated with the outcome, even if all of the variables are in reality strongly associated with the outcome.
The problem of multicollinearity is a member of a family of problems with fitting models, a family sometimes known as non-identifiability.
When a parameter is non-identifiable, it means that the structure of the data and model do not make it possible to estimate the parameter’s value. Sometimes this problem arises from mistakes in coding a model, but many important types of models present non-identifiable or weakly identifiable parameters, even when coded completely correctly. Nature does not owe us easy inference, even when the model is correct.
\(height \sim N(\mu_i, \sigma)\)
\(\mu_i = \alpha_{0} + \beta_{l} x_{leg\:left}+ \beta_{r} x_{leg\:right}\)
Imagine trying to predict an individual’s height using the length of his or her legs as predictor variables. Surely height is positively associated with leg length, or at least our simulation will assume it is. Nevertheless, once you put both legs (right and left) into the model, something vexing will happen.
The code below will simulate the heights and leg lengths of 100 individuals. For each, first a height is simulated from a Gaussian distribution. Then each individual gets a simulated proportion of height for their legs, ranging from 0.4 to 0.5. Finally, each leg is salted with a little measurement or developmental error, so the left and right legs are not exactly the same length, as is typical in real populations. At the end, the code puts height and the two leg lengths into a common data frame.
Now let’s analyze these data, predicting the outcome height with both predictors, leg_left and leg_right. Before approximating the posterior, however, consider what we expect. On average, an individual’s legs are 45% of their height (in these simulated data). So we should expect the beta coefficient that measures the association of a leg with height to end up around the average height (10) divided by 45% of the average height (4.5). This is 10/4.5 ≈ 2.2. Now let’s see what happens instead. I’ll use very vague, bad priors here, just so we can be sure that the priors aren’t responsible for what is about to happen.
N <- 100
set.seed(909)
height <- rnorm(N,10,2)
leg_prop <- runif(N,0.4,0.5)
leg_left <- leg_prop*height +
rnorm( N , 0 , 0.02 )
leg_right <- leg_prop*height +
rnorm( N , 0 , 0.02 )
d <- data.frame(height,leg_left,leg_right)
m6.1 <- quap(
alist(
height ~ dnorm( mu , sigma ) ,
mu <- a + bl*leg_left + br*leg_right ,
a ~ dnorm( 10 , 100 ) ,
bl ~ dnorm( 2 , 10 ) ,
br ~ dnorm( 2 , 10 ) ,
sigma ~ dexp( 1 )
),
data=d )
precis(m6.1)
## mean sd 5.5% 94.5%
## a 0.9812791 0.28395540 0.5274635 1.4350947
## bl 0.2118585 2.52703706 -3.8268348 4.2505518
## br 1.7836774 2.53125061 -2.2617500 5.8291047
## sigma 0.6171026 0.04343427 0.5476862 0.6865189
plot(precis(m6.1))
Those posterior means and standard deviations look crazy. Both legs have almost identical lengths, and height is so strongly associated with leg length, then why is this posterior distribution so weird? Did the posterior approximation work correctly?
The posterior distribution here is the right answer to the question we asked. The problem is the question. Recall that a multiple linear regression answers the question: What is the value of knowing each predictor, after already knowing all of the other predictors? So in this case, the question becomes: What is the value of knowing each leg’s length, after already knowing the other leg’s length?
The answer to this weird question is equally weird, but perfectly logical. The posterior distribution is the answer to this question, considering every possible combination of the parameters and assigning relative plausibilities to every combination, conditional on this model and these data. It might help to look at the joint posterior distribution for bl and br. The posterior distribution for these two parameters is very highly correlated, with all of the plausible values of bl and br lying along a narrow ridge. When bl is large, then br must be small. What has happened here is that since both leg variables contain almost exactly the same information, if you insist on including both in a model, then there will be a practically infinite number of combinations of bl and br that produce the same predictions.
post <- extract.samples(m6.1)
plot( bl ~ br , post , col=col.alpha(rangi2,0.1) , pch=16 )
One way to think of this phenomenon is that you have approximated this model:
\(height \sim N(\mu_i, \sigma)\)
\(\mu_i = \alpha_{0} + \beta_{1} x_{i}+ \beta_{2} x_{i}\)
The variable y is the outcome, like height in the example, and x is a single predictor, like the leg lengths in the example. Because the left and right legs are so correlated, it’s like using x twice. From the golem’s perspective, the model for μi is:
\(\mu_i = \alpha_{0} + (\beta_{l} + \beta_{r}) x\)
The parameters β1 and β2 cannot be pulled apart, because they never separately influence the mean μ. Only their sum, β1+β2, influences μ. So this means the posterior distribution ends up reporting the very large range of combinations of β1 and β2 that make their sum close to the actual association of x with y.
And the posterior distribution in this simulated example has done exactly that: It has produced a good estimate of the sum of bl and br. Here’s how you can compute the posterior distribution of their sum, and then plot it:
sum_blbr <- post$bl + post$br
dens( sum_blbr , col=rangi2 , lwd=2 , xlab="sum of bl and br" )
The posterior mean is in the right neighborhood, a little over 2, and the standard deviation is much smaller than it is for either component of the sum, bl or br. If you fit a regression with only one of the leg length variables, you’ll get approximately the same posterior mean:
m6.2 <- quap( alist(
height ~ dnorm( mu , sigma ) , mu <- a + bl*leg_left,
a ~ dnorm( 10 , 100 ) ,
bl ~ dnorm( 2 , 10 ) ,
sigma ~ dexp( 1 ) ) , data=d )
precis(m6.2)
## mean sd 5.5% 94.5%
## a 0.9979326 0.28364620 0.5446112 1.451254
## bl 1.9920676 0.06115704 1.8943269 2.089808
## sigma 0.6186038 0.04353998 0.5490185 0.688189
That 1.99 is almost identical to the mean value of sum_blbr.
The basic lesson is only this: When two predictor variables are very strongly correlated (conditional on other variables in the model), including both in a model may lead to confu- sion. The posterior distribution isn’t wrong, in such cases. It’s telling you that the question you asked cannot be answered with these data. And that’s a great thing for a model to say, that it cannot answer your question. And if you are just interested in prediction, you’ll find that this leg model makes fine predictions. It just doesn’t make any claims about which leg is more important.
In this example, we are concerned with the perc.fat (percent fat) and perc.lactose (per- cent lactose) variables. We’ll use these to model the total energy content, kcal.per.g. The code above has already standardized these three variables. You’re going to use these three variables to explore a natural case of multicollinearity.
library(rethinking)
data(milk)
d <- milk
d$K <- scale( d$kcal.per.g )
d$F <- scale( d$perc.fat )
d$L <- scale( d$perc.lactose )
Start by modeling kcal.per.g as a function of perc.fat and perc.lactose, but in two bivariate regressions.
# kcal.per.g regressed on perc.fat
m6.3 <- quap(
alist(
K ~ dnorm( mu , sigma ) ,
mu <- a + bF*F ,
a ~ dnorm( 0 , 0.2 ) ,
bF ~ dnorm( 0 , 0.5 ) ,
sigma ~ dexp( 1 )
) , data=d )
# kcal.per.g regressed on perc.lactose
m6.4 <- quap(
alist(
K ~ dnorm( mu , sigma ) ,
mu <- a + bL*L ,
a ~ dnorm( 0 , 0.2 ) ,
bL ~ dnorm( 0 , 0.5 ) ,
sigma ~ dexp( 1 )
) , data=d )
precis( m6.3 )
## mean sd 5.5% 94.5%
## a 1.535526e-07 0.07725195 -0.1234634 0.1234637
## bF 8.618970e-01 0.08426088 0.7272318 0.9965621
## sigma 4.510179e-01 0.05870756 0.3571919 0.5448440
precis( m6.4 )
## mean sd 5.5% 94.5%
## a 7.438895e-07 0.06661633 -0.1064650 0.1064665
## bL -9.024550e-01 0.07132848 -1.0164517 -0.7884583
## sigma 3.804653e-01 0.04958259 0.3012227 0.4597078
The posterior distributions for bF and bL are essentially mirror images of one another. The posterior mean of bF is as positive as the mean of bL is negative. Both are narrow posterior distributions that lie almost entirely on one side or the other of zero. Given the strong association of each predictor with the outcome, we might conclude that both variables are reliable predictors of total energy in milk, across species. The more fat, the more kilocalories in the milk. The more lactose, the fewer kilocalories in milk. But watch what happens when we place both predictor variables in the same regression model:
m6.5 <- quap(
alist(
K ~ dnorm( mu , sigma ) ,
mu <- a + bF*F + bL*L ,
a ~ dnorm( 0 , 0.2 ) ,
bF ~ dnorm( 0 , 0.5 ) ,
bL ~ dnorm( 0 , 0.5 ) ,
sigma ~ dexp( 1 )
),
data=d )
precis( m6.5 )
## mean sd 5.5% 94.5%
## a -3.172136e-07 0.06603577 -0.10553823 0.1055376
## bF 2.434983e-01 0.18357865 -0.04989579 0.5368925
## bL -6.780825e-01 0.18377670 -0.97179320 -0.3843719
## sigma 3.767418e-01 0.04918394 0.29813637 0.4553472
Now the posterior means of both bF and bL are closer to zero. And the standard deviations for both parameters are twice as large as in the bivariate models (m6.3 and m6.4).
This is the same statistical phenomenon as in the leg length example. What has happened is that the variables perc.fat and perc.lactose contain much of the same information. They are almost substitutes for one another. As a result, when you include both in a regression, the posterior distribution ends up describing a long ridge of combinations of bF and bL that are equally plausible. In the case of the fat and lactose, these two variables form essentially a single axis of variation. The easiest way to see this is to use a pairs plot:
pairs( ~ kcal.per.g + perc.fat + perc.lactose , data=d , col=rangi2 )
Percent fat is positively correlated with the outcome, while percent lactose is negatively correlated with it.
Now look at the right-most scatterplot in the middle row. This plot is the scatter of percent fat (vertical) against percent lactose (horizontal). Notice that the points line up almost entirely along a straight line. These two variables are negatively correlated, and so strongly so that they are nearly redundant. Either helps in predicting kcal.per.g, but neither helps as much once you already know the other.
In the scientific literature, you might encounter a variety of dodgy ways of coping with multicollinearity. Few of them take a causal perspective. Some fields actually teach students to inspect pairwise correlations before fitting a model, to identify and drop highly correlated predictors. This is a mistake. Pairwise correlations are not the problem. It is the conditional associations—not correlations—that matter. And even then, the right thing to do will de- pend upon what is causing the collinearity. The associations within the data alone are not enough to decide what to do.
What is likely going on in the milk example is that there is a core tradeoff in milk composition that mammal mothers must obey. If a species nurses often, then the milk tends to be watery and low in energy. Such milk is high in sugar (lactose). If instead a species nurses rarely, in short bouts, then the milk needs to be higher in energy. Such milk is very high in fat. This implies a causal model something like this:
The central tradeoff decides how dense, D, the milk needs to be. We haven’t observed this variable, so it’s shown circled. Then fat, F, and lactose, L, are determined. Finally, the com- position of F and L determines the kilocalories, K. If we could measure D, or had an evolutionary and economic model to predict it based upon other aspects of a species, that would be better than stumbling through regressions.
In general, there’s no guarantee that the available data contain much information about a parameter of interest. When that’s true, your Bayesian machine will return a posterior distribution very similar to the prior. Comparing the posterior to the prior can therefore be a good idea, a way of seeing how much information the model extracted from the data. When the posterior and prior are similar, it doesn’t mean the calculations are wrong—you got the right answer to the question you asked. But it might lead you to ask a better question.